feat(agent): classify tool-call failures + surface on web event — rework of #4413 (backend)#4445
Conversation
|
Warning Review limit reachedYou’ve reached a temporary PR review limit under our Fair Usage Limits Policy. Next review available in: 22 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (17)
📝 WalkthroughWalkthroughIntroduces a new ChangesTool failure classification pipeline
Estimated code review effort: 3 (Moderate) | ~30 minutes Sequence Diagram(s)sequenceDiagram
participant Middleware as ToolOutcomeCaptureMiddleware
participant FailureMap as ToolFailureMap
participant Bridge as OpenhumanEventBridge
participant ProgressBridge as spawn_progress_bridge
participant Socket as WebChannelEvent
Middleware->>Middleware: after_tool detects result.error
Middleware->>Middleware: classify error via tool_status::classify/describe
Middleware->>FailureMap: insert(call_id, (success, failure))
Bridge->>FailureMap: remove(call_id) on ToolCompleted
FailureMap-->>Bridge: (success, failure) or absent
Bridge->>Bridge: build AgentProgress::ToolCallCompleted{success, failure}
Bridge->>ProgressBridge: emit AgentProgress event
ProgressBridge->>ProgressBridge: serde_json::to_value(failure) -> failure_json
ProgressBridge->>Socket: emit tool_result{failure: failure_json}
Suggested labels: Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 96b4414dcb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "403", | ||
| "forbidden", |
There was a problem hiding this comment.
Don't map external 403s to policy blocks
When an integration or API-backed tool returns a normal authorization failure such as 403 insufficient authentication scopes (this exact Gmail/Composio case is exercised in tests/composio_raw_coverage_e2e.rs), these broad needles classify it as BlockedByPolicy, so the UI tells the user to change Agent access settings and marks it non-recoverable instead of prompting them to reconnect/grant scopes. Please reserve BlockedByPolicy for OpenHuman policy-specific markers/text and classify bare HTTP 403/Forbidden as credentials or user-permission related.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/openhuman/tinyagents/middleware.rs (1)
1399-1409: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winTimeout detection via string match on
"timed out"is fragile.Coupling classification to a literal substring in
contentwill silently stop matching if the message that produces it is ever reworded elsewhere in the codebase. A structuredtimed_out: boolcarried on the tool result (or a dedicated error variant) would be more robust than text sniffing, though the current heuristic is explicitly called out as intentional in the comment.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/tinyagents/middleware.rs` around lines 1399 - 1409, The timeout classification in the middleware is currently inferred by matching the literal substring "timed out" in result.content, which is brittle. Update the logic in tinyagents::middleware::classify to use a structured timeout indicator from the tool result (for example a timed_out flag or dedicated error variant) instead of text sniffing, while keeping the existing POLICY_BLOCKED_MARKER handling intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/tool_status/ops.rs`:
- Around line 60-75: The numeric status-code needles in the ToolFailureClass
detection logic can false-positive because contains_any does plain substring
matching; update the matching in the relevant classification branches (including
the BlockedByPolicy, BadCredentials, and ServiceUnavailable checks in the ops.rs
failure classifier) so codes like 403, 401, 502, 503, and 504 only match when
not surrounded by other digits. Use a boundary-aware check or equivalent guard
before classifying, while keeping the existing keyword-based needles as-is.
---
Nitpick comments:
In `@src/openhuman/tinyagents/middleware.rs`:
- Around line 1399-1409: The timeout classification in the middleware is
currently inferred by matching the literal substring "timed out" in
result.content, which is brittle. Update the logic in
tinyagents::middleware::classify to use a structured timeout indicator from the
tool result (for example a timed_out flag or dedicated error variant) instead of
text sniffing, while keeping the existing POLICY_BLOCKED_MARKER handling intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bf1c9617-ea91-4bb0-8793-d408a4150acc
📒 Files selected for processing (16)
src/bin/harness_subagent_audit.rssrc/core/socketio.rssrc/openhuman/agent/harness/session/tool_progress.rssrc/openhuman/agent/progress.rssrc/openhuman/agent/progress_tracing/tests.rssrc/openhuman/channels/proactive.rssrc/openhuman/channels/providers/presentation.rssrc/openhuman/channels/providers/web/progress_bridge.rssrc/openhuman/mod.rssrc/openhuman/threads/turn_state/mirror_tests.rssrc/openhuman/tinyagents/middleware.rssrc/openhuman/tinyagents/mod.rssrc/openhuman/tinyagents/observability.rssrc/openhuman/tool_status/mod.rssrc/openhuman/tool_status/ops.rssrc/openhuman/tool_status/types.rs
…ent (tinyhumansai#4254, backend) Rework of tinyhumansai#4413's backend onto post-tinyhumansai#4399 code. tinyhumansai#4399 deleted the original classification hook (session/agent_tool_exec.rs) and the ProgressReporter trait; the tool loop now runs on the tinyagents crate, so the hook is re-homed on the adapter seam. - New `src/openhuman/tool_status/` module: `classify(error_text, timed_out)` and `describe(class)` map a tool outcome to a `ClassifiedFailure` (class + category + plain-language cause + next action). Pure, unit-tested (16 tests: all classes, timeout/unknown/precedence, category mapping, recoverability). - `AgentProgress::ToolCallCompleted` + `WebChannelEvent` gain a `failure` field carrying that classification to the chat "View processing" timeline. - Core hook + latent-bug fix: the crate's `AgentEvent::ToolCompleted` carries only call_id + tool_name, so the bridge previously hardcoded `success: true` (a failed tool never surfaced). `ToolOutcomeCaptureMiddleware::after_tool` now classifies each result and writes a `call_id → (success, failure)` side-channel (mirroring the existing `ToolNameMap`); the bridge reads it to emit real `success` + `failure`. Absent entry falls back to `(true, None)`. - `progress_bridge` serializes `failure` onto the `tool_result` socket event and the run ledger. Backend only. The frontend consumer (chat timeline "why / next" render) and i18n keys are a companion — they rebase cleanly onto current `main` from tinyhumansai#4413 (only the Rust hook conflicted with tinyhumansai#4399), so they are not re-included here to avoid reverting unrelated locale drift. NOTE: the middleware-before-event ordering that populates the side-channel is crate-internal; the fallback keeps behavior no worse than before if the event is ever projected first. Needs runtime validation once tinyhumansai#4442 lands (drive a deliberately-failing tool and confirm `success:false` + `failure` on the wire).
96b4414 to
487be7e
Compare
…tatus needles Addresses CodeRabbit review on tinyhumansai#4445: - Bare HTTP 403/Forbidden (e.g. Gmail/Composio "403 insufficient authentication scopes") was classified as BlockedByPolicy, telling the user to change Agent-access settings and marking it non-recoverable. It's an external authz failure → now BadCredentials (reconnect / grant scopes). OpenHuman's own policy blocks are tagged upstream with POLICY_BLOCKED_MARKER (never reach this heuristic); the OpenHuman-specific 'forbidden path' marker stays policy, checked before credentials so it still wins. - contains_any did unbounded substring matching, so '403'/'401'/'50x' false-matched inside longer digit runs (a '403' in '14033', a port, a byte count, a timestamp). New contains_code() requires non-digit boundaries on each side. - Regression tests: external 403/401 → credentials, 'forbidden path' → policy, embedded 403/503 in longer numbers → Unknown, standalone 503 → service.
…ed in tinyhumansai#4445 tinyhumansai#4413 originally shipped both the tool-failure classification backend and the chat "View processing" render. The backend conflicted with tinyhumansai#4399 (harness migration) and has been reworked separately in tinyhumansai#4445 (tool_status module + ToolOutcomeCaptureMiddleware + WebChannelEvent.failure). This reduces tinyhumansai#4413 to just the frontend that consumes tinyhumansai#4445's `failure` field off the tool_result socket event, rebased cleanly onto current main. Kept (21 files): - chat timeline render: ProcessingTranscriptView (+test), ChatRuntimeProvider, chatService, chatRuntimeSlice (+toolFailure test), turnState. - i18n: conversations.toolFailure.* keys across all 14 locales (3-way unioned onto current main so no existing translations are reverted). Dropped (now owned by tinyhumansai#4445): tool_status/, agent progress/observability, middleware, event_bus, socketio failure plumbing, and the deleted-file conflict on agent_tool_exec.rs. Verified: pnpm typecheck clean, i18n:check parity 0 missing/0 extra, 13 frontend tests pass (chatRuntimeSlice.toolFailure + ProcessingTranscriptView). Stacked on tinyhumansai#4445 (merge that first — it emits the field this renders).
The tool-failure feature added AgentProgress::ToolCallCompleted.failure; an integration test (tests/, a separate compilation unit not covered by `cargo check --lib`) constructed the event without it, failing the Rust Core Coverage job (E0063). `cargo check --tests` is now clean.
…om, terminal-inference halt, autonomous iter-cap lift (tinyhumansai#4463) Restores loop-guard behaviors dropped with the legacy `tool_loop.rs` in the TinyAgents migration; the crate `no_progress` ladder tracks *failures* only and does not replace them. All fixes are seam-side (`src/openhuman/**`); vendored crate untouched. - RepeatProgressMiddleware (new, `after_model` + batch-aware `after_tool`): halts on 3× identical *successful* `(tool, args)` batches (tinyhumansai#4088) and 4× identical outputs (tinyhumansai#4095); `wait_subagent` polling exemption ported. Shares the halt-summary slot + steering handle with the failure breaker. - RepeatedToolFailureMiddleware: differentiated headroom restored — recoverable classes (timeout/transient markers + tinyhumansai#4445 classifier: Timeout/ ServiceUnavailable/ModelConnection) get 8 identical / 12 varied instead of the crate's fixed 3/6; POLICY_DENIED_MARKER rejoins POLICY_BLOCKED_MARKER for the 2-repeat hard-reject fast-trip (tinyhumansai#4463 part 6). - Terminal delegated-inference fast-halt (tinyhumansai#3104): budget-exhausted / provider-config-rejection delegation failures halt on first occurrence, gated on the delegated-inference envelope so recoverable tool stderr isn't caught. - Autonomous iter-cap lift (part 7): `autonomous_iter_cap()` was a dead knob — wired back in via `subagent_iter_cap_with_autonomous_lift` at the subagent `effective_max_iterations()` sites, so task-dispatcher / skill subagents run up to TASK_RUN_MAX_ITERATIONS again instead of the normal per-agent cap. - Updated the adapter-inventory middleware-count assertions (15→16, 11→12). Tests deferred to the final parity test pass. Claude-Session: https://claude.ai/code/session_019j5TLsRLHsM3kqAYFyH4hR
Summary
Rework of #4413's backend onto post-#4399
main. Classifies a failed tool call into a user-facing explanation (plain-language cause + what to do next) and surfaces it on thetool_resultsocket event, so the chat "View processing" timeline can render "why / next" instead of a raw error.Why a rework
#4399 (the TinyAgents harness migration) deleted the original classification hook (
session/agent_tool_exec.rs) and theProgressReportertrait, and moved the tool loop onto thetinyagentscrate. So the hook is re-homed on the adapter seam.Changes
src/openhuman/tool_status/—classify(error_text, timed_out)+describe(class)→ClassifiedFailure(class, category, cause, next action). Pure, 16 unit tests.failurefield onAgentProgress::ToolCallCompletedandWebChannelEvent.AgentEvent::ToolCompletedcarries onlycall_id+tool_name, so the bridge previously hardcodedsuccess: true— a failed tool never surfaced.ToolOutcomeCaptureMiddleware::after_toolnow classifies each result and writes acall_id → (success, failure)side-channel (mirroring the existingToolNameMap); the bridge reads it to emit realsuccess+failure(fallback(true, None)if absent).progress_bridgeserializesfailureonto the socket event + run ledger.Scope note — backend only
The frontend consumer (timeline render) + i18n keys are a companion. They rebase cleanly onto current
mainfrom #4413 (only the Rust hook conflicted with #4399); taking the whole i18n files here would revert unrelated locale drift onmain, so they're intentionally excluded. See #4413.Validation
tool_statusunit tests pass; lib + test build green (with the fix(orchestration): repair TinyAgents-migration build break (main does not compile) #4442 build-fix applied).success: false+ afailureobject on thetool_resultevent. The middleware-before-event ordering that fills the side-channel is crate-internal; the fallback keeps behavior no worse than today if that assumption is ever wrong.Related
Summary by CodeRabbit
New Features
Bug Fixes